Introduction

New York City has regular annual inspections regarding the restaurants that inhabit the area. New York city encompasses five boroughs: Manhattan, Bronx, Brooklyn, Queens, and Staten Island. Inspectors from all the different boroughs check restaurants individually to oversee the compliance of city and state food safety regulations. If a restaurant violates any of these regulations, the inspector marks points. The ranges of these points are: 1-14, which constitutes to the letter grade of an A; 14-27, which constitutes to the letter grade of a B; 28 or higher, which constitutes to the letter grade of a C. Generally, the lower the points a restaurant receives, the more hygenic it is regarding New York’s food safety regulations. The points are marked in regard to the violation codes, which are composed of letters and numbers. These codes are used by inspectors and the health department as a means to effectively explain the violation that has occured. New York city has gathered a large data of restaurant inspections that is available to the public. The inspection dates vary from the year of 2013 to the year of 2017. The large database can be found on www.nyc.gov and includes specific information regarding the restaurant’s: zip code, building number, phone number, street number, etc. The restaurants were grouped by the style or method of cooking regarding a particular region. Some of these groupings were Mexican, Chinese, Latin, etc. As a variable of interest to our group, this column was referred to as cuisine in the original data.

For the purpose of our analysis, our focus will be mainly on the variables: cuisine, inspection date, borough, and score. The dataset was too large to focus on and analyze the many cuisines, so we narrowed our focal point down to the top five cuisines (American, Chinese, Mexican, Italian, and Japanese). We were then able to formulate our main question which is “where is the ideal location for running a restaurant regarding the top five popular cuisines in New York city?” Once our main question was established, we were able to focus our interest on whether there is a correlation between cuisine and location, a borough has more restaurants of a specific cuisine, and there is a correlation between violation codes and cuisines. Finally we wanted to see if a cuisine has more of a specific violation code than others.

Ethical Consideration

Our dataset is open data provided by New York City. Therefore, it gives us the right to publish our findings. If the dataset was private, then it would be unethical to publish without consent. However, if we were to decide to publish our findings, this would affect New York’s city residents and restaurant owners. This publishing could result in the possibility of New York’s residents becoming more predisposed in choosing where to eat. An example would be, if we find a lower score in American cuisines (a lower score is the result of less violation codes) and a higher score in Chinese (a higher score is the result of greater violation codes), New York residents would most likely choose to eat at the American restaurant than the Chinese restaurant. This would result in Chinese places’ owners becoming more prone to criticism and decline in business. In addition, a higher score may suggest that the quality of the restaurant is worse. However, it should not be interpreted as being dirty, since the violations are not just about sanitation. To sum up, it is important to look at the analysis without any assumptions, especially when it comes to the quality of the restaurants themselves.

Data Exploration

Finding top 5 cuisines in NYC

# finding top 5 cuisines
TopCuisines <- table(NYC_Data$Cuisine)
# -> top 5 cuisines: American, Chinese, Italian, Japanese, Mexican

Cuisines <- NYC_Data %>%
  filter(!is.na(SCORE)) %>%
  dplyr::select(BORO, Cuisine, ViolationCode, ViolationDescription,
           InspectionDate, Latitude, Longitude, SCORE, ZIPCODE, State, County) %>%
  filter(Cuisine == "American" | Cuisine == "Chinese"|
           Cuisine == "Italian" | Cuisine == "Japanese" | Cuisine == "Mexican",
         Latitude != 0,
         Longitude != 0,
         County == "Richmond County" | County == "New York County" | County == "Bronx County" |
           County == "Kings County" | County == "Queens County")

Cuisines$Cuisine <- as.numeric(
  as.character(
    factor(
      Cuisines$Cuisine,
      levels = c("American", "Chinese", "Italian", "Japanese", "Mexican"),
      labels = c("1", "2", "3", "4", "5"))))
Cuisines_frq <- Cuisines %>%
  group_by(Cuisine) %>%
  dplyr::select(Cuisine) %>%
  filter(Cuisine == "1" | Cuisine == "2" | Cuisine == "3" | Cuisine == "4" | Cuisine == "5") %>%
  summarise(Frequency = sum(Cuisine == "1", Cuisine == "2", Cuisine == "3", Cuisine == "4", Cuisine == "5"))

Cuisines_frq$Cuisine <- factor(Cuisines_frq$Cuisine, levels = c(1, 2, 3, 4, 5),
                             labels = c("American", "Chinese", "Italian", "Japanese", "Mexican"))

Cuisines_frq <- Cuisines_frq[order(-Cuisines_frq$Frequency) , ]

kable(Cuisines_frq, caption = "Top 5 Cuisines of New York City's Restaurants")
Top 5 Cuisines of New York City’s Restaurants
Cuisine Frequency
American 83788
Chinese 39389
Italian 16504
Mexican 14292
Japanese 13540

The top 5 cuisines in New York City are American (83,788), Chinese (39,389), Italian (16,504), Mexican (14,292), and Japanese (13,540). American cuisine restaurants account for the most, almost doubling the number of Chinese restaurants (which is the second most populat cuisine in New York City).

Maps

Cuisines_Dummy <- NYC_Data %>%
  filter(!is.na(SCORE)) %>%
  dplyr::select(BORO, Cuisine, ViolationCode, ViolationDescription,
           InspectionDate, Latitude, Longitude, SCORE, ZIPCODE, State, County) %>%
  filter(Cuisine == "American" | Cuisine == "Chinese"|
           Cuisine == "Italian" | Cuisine == "Japanese" | Cuisine == "Mexican",
         Latitude != 0,
         Longitude != 0,
         County == "Richmond County" | County == "New York County" | County == "Bronx County" |
           County == "Kings County" | County == "Queens County")

pal <- colorFactor(
  palette = c('red', 'orange', 'sky blue', 'yellow', 'dark green'),
  levels = Cuisines_Dummy$Cuisine,
  domain = Cuisines_Dummy$Cuisine
)

NewYork <- leaflet("New York, USA") %>%
  addTiles() %>%
  addCircleMarkers(data = Cuisines_Dummy, radius = 3, color = ~pal(Cuisine), clusterOptions = markerClusterOptions()) %>%
  addLegend(pal = pal, values = Cuisines_Dummy$Cuisine,
            title = "Cuisine") %>%
  setView(-73.98513, 40.7589, zoom = 13)

NewYork

Finding top 5 violation codes of NYC restaurants

# Finding top 5 violation codes
TopViolationCode <- table(Cuisines$ViolationCode)

Violation_frq <- Cuisines %>%
  group_by(ViolationCode, ViolationDescription) %>%
  dplyr::select(ViolationCode, ViolationDescription) %>%
  filter(ViolationCode == "10F" | ViolationCode == "08A" | ViolationCode == "06D" |
         ViolationCode == "02G" | ViolationCode == "06C") %>%
  summarise(Frequency = sum(ViolationCode == "10F", ViolationCode == "08A", ViolationCode == "06D",
         ViolationCode == "02G", ViolationCode == "06C"))

Violation_frq <- Violation_frq[order(-Violation_frq$Frequency) , ]
kable(Violation_frq, caption = "Top 5 violation codes for the five most popular cuisines in NYC")
Top 5 violation codes for the five most popular cuisines in NYC
ViolationCode ViolationDescription Frequency
10F Non-food contact surface improperly constructed. Unacceptable material used. Non-food contact surface or equipment improperly maintained and/or not properly sealed, raised, spaced or movable to allow accessibility for cleaning on all sides, above and underneath the unit. 24486
08A Facility not vermin proof. Harborage or conditions conducive to attracting vermin to the premises and/or allowing vermin to exist. 17499
06D Food contact surface not properly washed, rinsed and sanitized after each use and following any activity when contamination may have occurred. 12639
02G Cold food item held above 41º F (smoked fish and reduced oxygen packaged foods above 38 ºF) except during necessary preparation. 12467
06C Food not protected from potential source of contamination during storage, preparation, transportation, display or service. 12299

Stacked bar plot for top 5 violation codes in each borough and each cuisine

Before going into deeper analysis, we want to see which violation code is the most common in each borough. The tables below demonstrate the count of the top 5 violation codes and the percentage of them within each borough. Overall, it can be seen that 10F remains to be the most common violation code across the city. Another finding from the table is that the second most common violation code is 08A for every borough but Staten Island, whose second most common violation code is 06D. However, since the number of restaurants in Staten Island are fairly low compared to other boroughs, it can be safe to say that the two most common violation code across New York City is 10F and 08A.

Manhattan_Viocode <- Cuisines %>%
  group_by(BORO, ViolationCode) %>%
  filter(BORO == "MANHATTAN",
         ViolationCode == "10F" | ViolationCode == "08A" |
                        ViolationCode == "06D" | ViolationCode == "02G" |
                        ViolationCode == "06C") %>%
  summarise(Count = sum(ViolationCode == "10F", ViolationCode == "08A",
                        ViolationCode == "06D", ViolationCode == "02G",
                        ViolationCode == "06C"),
            Percentage = (Count / (10930 + 7729 + 6201 + 6081 + 5631))*100)
Manhattan_Viocode <- Manhattan_Viocode[order(-Manhattan_Viocode$Count) , ]
kable(Manhattan_Viocode, caption = "Manhattan")
Manhattan
BORO ViolationCode Count Percentage
MANHATTAN 10F 10930 29.88625
MANHATTAN 08A 7729 21.13365
MANHATTAN 06D 6201 16.95559
MANHATTAN 02G 6081 16.62747
MANHATTAN 06C 5631 15.39702
Bronx_Viocode <- Cuisines %>%
  group_by(BORO, ViolationCode) %>%
  filter(BORO == "BRONX",
         ViolationCode == "10F" | ViolationCode == "08A" |
                        ViolationCode == "06D" | ViolationCode == "02G" |
                        ViolationCode == "06C") %>%
  summarise(Count = sum(ViolationCode == "10F", ViolationCode == "08A",
                        ViolationCode == "06D", ViolationCode == "02G",
                        ViolationCode == "06C"),
            Percentage = (Count / (1887 + 1515 + 873 + 840 + 787)) *100 )
Bronx_Viocode <- Bronx_Viocode[order(-Bronx_Viocode$Count) , ]
kable(Bronx_Viocode,caption = "Bronx")
Bronx
BORO ViolationCode Count Percentage
BRONX 10F 1887 31.97221
BRONX 08A 1515 25.66926
BRONX 06C 873 14.79160
BRONX 02G 840 14.23246
BRONX 06D 787 13.33446
Brooklyn_Viocode <- Cuisines %>%
  group_by(BORO, ViolationCode) %>%
  filter(BORO == "BROOKLYN",
         ViolationCode == "10F" | ViolationCode == "08A" |
                        ViolationCode == "06D" | ViolationCode == "02G" |
                        ViolationCode == "06C") %>%
  summarise(Count = sum(ViolationCode == "10F", ViolationCode == "08A",
                        ViolationCode == "06D", ViolationCode == "02G",
                        ViolationCode == "06C"),
            Percentage = (Count / (5990 + 4450 + 2959 + 2811 + 2784) *100))
Brooklyn_Viocode <- Brooklyn_Viocode[order(-Brooklyn_Viocode$Count) , ]
kable(Brooklyn_Viocode,caption = "Brooklyn")
Brooklyn
BORO ViolationCode Count Percentage
BROOKLYN 10F 5990 31.53627
BROOKLYN 08A 4450 23.42845
BROOKLYN 06C 2959 15.57860
BROOKLYN 06D 2811 14.79941
BROOKLYN 02G 2784 14.65726
StatenIsland_Viocode <- Cuisines %>%
  group_by(BORO, ViolationCode) %>%
  filter(BORO == "STATEN ISLAND",
         ViolationCode == "10F" | ViolationCode == "08A" |
                        ViolationCode == "06D" | ViolationCode == "02G" |
                        ViolationCode == "06C") %>%
  summarise(Count = sum(ViolationCode == "10F", ViolationCode == "08A",
                        ViolationCode == "06D", ViolationCode == "02G",
                        ViolationCode == "06C"),
            Percentage =(Count / (942 + 627 + 561 + 561 + 477))*100)
StatenIsland_Viocode <- StatenIsland_Viocode[order(-StatenIsland_Viocode$Count) , ]
kable(StatenIsland_Viocode,caption = "Staten Island")
Staten Island
BORO ViolationCode Count Percentage
STATEN ISLAND 10F 942 29.73485
STATEN ISLAND 06D 627 19.79167
STATEN ISLAND 02G 561 17.70833
STATEN ISLAND 08A 561 17.70833
STATEN ISLAND 06C 477 15.05682
Queens_Viocode <- Cuisines %>%
  group_by(BORO, ViolationCode) %>%
  filter(BORO == "QUEENS",
         ViolationCode == "10F" | ViolationCode == "08A" |
                        ViolationCode == "06D" | ViolationCode == "02G" |
                        ViolationCode == "06C") %>%
  summarise(Count = sum(ViolationCode == "10F", ViolationCode == "08A",
                        ViolationCode == "06D", ViolationCode == "02G",
                        ViolationCode == "06C"),
            Percentage = (Count / (4737 + 3244 + 2359 + 2213 + 2201)*100) )
Queens_Viocode <- Queens_Viocode[order(-Queens_Viocode$Count) , ]
kable(Queens_Viocode, caption = "Queens")
Queens
BORO ViolationCode Count Percentage
QUEENS 10F 4737 32.10655
QUEENS 08A 3244 21.98726
QUEENS 06C 2359 15.98888
QUEENS 06D 2213 14.99932
QUEENS 02G 2201 14.91799
Viocode_ggplot_boro <- rbind(Manhattan_Viocode, Bronx_Viocode, Brooklyn_Viocode, StatenIsland_Viocode, Queens_Viocode)
Viocode_ggplot_boro$Percentage <- round(Viocode_ggplot_boro$Percentage, 3)

ggplot(Viocode_ggplot_boro, aes( x = BORO, y = Percentage, fill = ViolationCode)) +
   geom_bar(position = position_stack(), stat = "identity", width = .7) +
  geom_text(aes(label = Percentage), position = position_stack(vjust = 0.5), size = 2.5) +
  scale_fill_manual(name="Violation Code", values = c("salmon", "dark green", "sky blue", "purple", "coral")) +
  theme(plot.title = element_text(hjust = 0.5)) +
  labs(title = " Percentage of Violation Codes by Boroughs", x = "Borough")

The bar plot shows the distribution of the top five violation codes in New York City restaurants inspections. All five boroughs show an even distribution of the violation codes, with 10F being the most common. However, the least common violation code (out of the five that are filtered) in each borough is different. In Bronx and Queens, it is 06D; in Brooklyn, it is 02G; in Manhattan and Staten Island, it is 06C. The difference is only by about 1 to 2 percent, which means the difference is not significant.

Top 5 Violation codes by Cuisines

Next, we want to see which violation code is the most common in each cuisine The tables below demonstrate the count of the top 5 violation codes and the percentage of them within each cuisine. Overall, it can be seen that 10F remains to be the most common violation code across the city. Another finding from the table is that the second most common violation code is 08A, which further confirms that the two most common violation code across New York City is 10F and 08A.

American_Viocode <- Cuisines %>%
  group_by(Cuisine, ViolationCode) %>%
  filter(Cuisine == "1",
         ViolationCode == "10F" | ViolationCode == "08A" |
                        ViolationCode == "06D" | ViolationCode == "02G" |
                        ViolationCode == "06C") %>%
  summarise(Count = sum(ViolationCode == "10F", ViolationCode == "08A",
                        ViolationCode == "06D", ViolationCode == "02G",
                        ViolationCode == "06C"),
            Percentage = (Count / (12821 + 8572 + 7442 + 6087 + 5571)) * 100 )
American_Viocode <- American_Viocode[order(-American_Viocode$Count) , ]

American_Viocode$Cuisine <- factor(American_Viocode$Cuisine, levels = c(1),
                             labels = c("American"))
kable(American_Viocode,caption = "American")
American
Cuisine ViolationCode Count Percentage
American 10F 12821 31.66226
American 08A 8572 21.16909
American 06D 7442 18.37849
American 02G 6087 15.03223
American 06C 5571 13.75793
Chinese_Viocode <- Cuisines %>%
  group_by(Cuisine, ViolationCode) %>%
  filter(Cuisine == "2",
         ViolationCode == "10F" | ViolationCode == "08A" |
                        ViolationCode == "06D" | ViolationCode == "02G" |
                        ViolationCode == "06C") %>%
  summarise(Count = sum(ViolationCode == "10F", ViolationCode == "08A",
                        ViolationCode == "06D", ViolationCode == "02G",
                        ViolationCode == "06C"),
            Percentage = (Count / (5666 + 4343 + 3375 + 3094 + 2085)) * 100 )
Chinese_Viocode <- Chinese_Viocode[order(-Chinese_Viocode$Count) , ]

Chinese_Viocode$Cuisine <- factor(Chinese_Viocode$Cuisine, levels = c(2),
                             labels = c("Chinese"))

kable(Chinese_Viocode,caption = "Chinese")
Chinese
Cuisine ViolationCode Count Percentage
Chinese 10F 5666 30.52308
Chinese 08A 4343 23.39600
Chinese 06C 3375 18.18133
Chinese 02G 3094 16.66756
Chinese 06D 2085 11.23202
Italian_Viocode <- Cuisines %>%
  group_by(Cuisine, ViolationCode) %>%
  filter(Cuisine == "3",
         ViolationCode == "10F" | ViolationCode == "08A" |
                        ViolationCode == "06D" | ViolationCode == "02G" |
                        ViolationCode == "06C") %>%
  summarise(Count = sum(ViolationCode == "10F", ViolationCode == "08A",
                        ViolationCode == "06D", ViolationCode == "02G",
                        ViolationCode == "06C"),
            Percentage = (Count / (2260 + 1600 + 1515 + 1315 + 1276)) * 100 )
Italian_Viocode <- Italian_Viocode[order(-Italian_Viocode$Count) , ]

Italian_Viocode$Cuisine <- factor(Italian_Viocode$Cuisine, levels = c(3),
                             labels = c("Italian"))

kable(Italian_Viocode,caption = "Italian")
Italian
Cuisine ViolationCode Count Percentage
Italian 10F 2260 28.37057
Italian 08A 1600 20.08536
Italian 06D 1515 19.01833
Italian 06C 1315 16.50766
Italian 02G 1276 16.01808
Japanese_Viocode <- Cuisines %>%
  group_by(Cuisine, ViolationCode) %>%
  filter(Cuisine == "4",
         ViolationCode == "10F" | ViolationCode == "08A" |
                        ViolationCode == "06D" | ViolationCode == "02G" |
                        ViolationCode == "06C") %>%
  summarise(Count = sum(ViolationCode == "10F", ViolationCode == "08A",
                        ViolationCode == "06D", ViolationCode == "02G",
                        ViolationCode == "06C"),
            Percentage = (Count / (1860 + 1364 + 1056 + 958 + 812)) * 100 )
Japanese_Viocode <- Japanese_Viocode[order(-Japanese_Viocode$Count) , ]

Japanese_Viocode$Cuisine <- factor(Japanese_Viocode$Cuisine, levels = c(4),
                             labels = c("Japanese"))

kable(Japanese_Viocode,caption = "Japanese")
Japanese
Cuisine ViolationCode Count Percentage
Japanese 10F 1860 30.74380
Japanese 08A 1364 22.54545
Japanese 06C 1056 17.45455
Japanese 02G 958 15.83471
Japanese 06D 812 13.42149
Mexican_Viocode <- Cuisines %>%
  group_by(Cuisine, ViolationCode) %>%
  filter(Cuisine == "5",
         ViolationCode == "10F" | ViolationCode == "08A" |
                        ViolationCode == "06D" | ViolationCode == "02G" |
                        ViolationCode == "06C") %>%
  summarise(Count = sum(ViolationCode == "10F", ViolationCode == "08A",
                        ViolationCode == "06D", ViolationCode == "02G",
                        ViolationCode == "06C"),
            Percentage = ( Count / (1879 + 1620 + 1052 + 982 + 785)) * 100 )
Mexican_Viocode <- Mexican_Viocode[order(-Mexican_Viocode$Count) , ]

Mexican_Viocode$Cuisine <- factor(Mexican_Viocode$Cuisine, levels = c(5),
                             labels = c("Mexican"))

kable(Mexican_Viocode,caption = "Mexican")
Mexican
Cuisine ViolationCode Count Percentage
Mexican 10F 1879 29.74042
Mexican 08A 1620 25.64103
Mexican 02G 1052 16.65084
Mexican 06C 982 15.54289
Mexican 06D 785 12.42482
Viocode_ggplot_cuisine <- rbind(American_Viocode, Chinese_Viocode, Italian_Viocode,
                                Japanese_Viocode, Mexican_Viocode)
Viocode_ggplot_cuisine$Percentage <- round(Viocode_ggplot_cuisine$Percentage, 3)

ggplot(Viocode_ggplot_cuisine, aes( x = Cuisine, y = Percentage, fill = ViolationCode)) +
   geom_bar(position = position_stack(), stat = "identity", width = .7) +
  geom_text(aes(label = Percentage), position = position_stack(vjust = 0.5), size = 2.5) +
  scale_fill_manual(name="Violation Code", values = c("salmon", "dark green", "sky blue", "purple", "coral")) +
  theme(plot.title = element_text(hjust = 0.5)) +
  labs(title = " Percentage of Violation Codes by Cuisines", x = "Cuisine")

The bar plot shows the distribution of the top five cuisines in different boroughs in New York City. The majority of the cuisines in each borough is American. The second most popular cuisine in Bronx, Brooklyn, Queens, and Manhattan are Chinese, while in Staten Island it is Italian. Most Chinese restaurants are in Queens, American and Japanese restaurants in Manhattan, and Italian restaurants in Staten Island.

Average scores of the boroughs

AvgScores_AllBoro <- NYC_Data %>% 
  filter(!is.na(SCORE)) %>%
  dplyr::select(BORO, SCORE) %>%
  filter(BORO != "Missing") %>%
  group_by(BORO) %>%
  summarise(AverageScore = mean(SCORE))
AvgScores_AllBoro <- AvgScores_AllBoro[order(AvgScores_AllBoro$AverageScore) , ]

kable(AvgScores_AllBoro)
BORO AverageScore
BRONX 18.18633
QUEENS 18.68833
MANHATTAN 19.00246
BROOKLYN 19.32913
STATEN ISLAND 19.56530

The table shows that the borough that has the lowest (or the best) score is Bronx, and the borough that has the highest (or worst) score is Staten Island.

Numbers of restaurants by cuisines in NYC and average scores (within top 5 cuisines)

Cuisines_AllBoro <- Cuisines %>%
  group_by(Cuisine) %>%
  summarise(Count = sum(Cuisine == "1", Cuisine == "2",
                        Cuisine == "3", Cuisine == "4",
                        Cuisine == "5"),
            AverageScore = mean(SCORE, na.rm = TRUE))

Cuisines_AllBoro$Cuisine <- factor(Cuisines_AllBoro$Cuisine, levels = c(1, 2, 3, 4, 5),
                             labels = c("American", "Chinese", "Italian", "Japanese", "Mexican"))
Cuisines_AllBoro <- Cuisines_AllBoro[order(Cuisines_AllBoro$AverageScore) , ]
kable(Cuisines_AllBoro,
      caption = "Number of restaurants of top 5 cuisines in each borough and average scores")
Number of restaurants of top 5 cuisines in each borough and average scores
Cuisine Count AverageScore
American 83788 18.09840
Italian 16504 18.61270
Japanese 13540 19.62194
Mexican 14292 20.00931
Chinese 39389 20.44167

The table shows the number of the restaurants by cuisines and their average scores. Since the lower the score, the better the quality of the restaurant (as stated by the New York City Inspection guides), American is the best and Chinese the worst. However, to conclude whether this is true, more tests should be conducted, which is what will be done in the later parts of the report.

Average scores of restaurants by cuisines in different boroughs

American_AvgScore <- Cuisines %>%
  dplyr::select(Cuisine,BORO, SCORE)%>%
  group_by(BORO) %>%
  filter(Cuisine == "1") %>%
  summarise(AvgScore = mean(SCORE) )

American_AvgScore <- American_AvgScore[order(American_AvgScore$AvgScore) , ]

kable(American_AvgScore,caption = "American")
American
BORO AvgScore
QUEENS 17.13697
BRONX 17.69871
BROOKLYN 18.18839
MANHATTAN 18.30010
STATEN ISLAND 19.65568
Chinese_AvgScore <- Cuisines %>%
  dplyr::select(Cuisine,BORO, SCORE)%>%
  group_by(BORO) %>%
  filter(Cuisine == "2") %>%
  summarise(AvgScore = mean(SCORE) )

Chinese_AvgScore <- Chinese_AvgScore[order(Chinese_AvgScore$AvgScore) , ]

kable(Chinese_AvgScore,caption = "Chinese")
Chinese
BORO AvgScore
BRONX 16.10982
STATEN ISLAND 18.49319
QUEENS 20.05533
BROOKLYN 20.11389
MANHATTAN 23.14858
Italian_AvgScore <- Cuisines %>%
  dplyr::select(Cuisine,BORO, SCORE)%>%
  group_by(BORO) %>%
  filter(Cuisine == "3") %>%
  summarise(AvgScore = mean(SCORE) )

Italian_AvgScore <- Italian_AvgScore[order(Italian_AvgScore$AvgScore) , ]

kable(Italian_AvgScore,caption = "Italian")
Italian
BORO AvgScore
QUEENS 17.54399
BROOKLYN 18.27536
MANHATTAN 18.65371
BRONX 19.66931
STATEN ISLAND 20.04337
Japanese_AvgScore <- Cuisines %>%
  dplyr::select(Cuisine,BORO, SCORE)%>%
  group_by(BORO) %>%
  filter(Cuisine == "4") %>%
  summarise(AvgScore = mean(SCORE) )

Japanese_AvgScore <- Japanese_AvgScore[order(Japanese_AvgScore$AvgScore) , ]

kable(Japanese_AvgScore,caption = "Japanese")
Japanese
BORO AvgScore
BRONX 15.24812
QUEENS 17.41325
MANHATTAN 20.02893
BROOKLYN 20.28324
STATEN ISLAND 20.41486
Mexican_AvgScore <- Cuisines %>%
  dplyr::select(Cuisine,BORO, SCORE)%>%
  group_by(BORO) %>%
  filter(Cuisine == "5") %>%
  summarise(AvgScore = mean(SCORE) )

Mexican_AvgScore <- Mexican_AvgScore[order(Mexican_AvgScore$AvgScore) , ]

kable(Mexican_AvgScore,caption = "Mexican")
Mexican
BORO AvgScore
QUEENS 19.19818
MANHATTAN 19.92573
BRONX 20.00174
STATEN ISLAND 20.10929
BROOKLYN 20.67153

Comparing the five tables above, we can assume that Queens have higher quality American, Italian, and Mexican food because the average scores for those cuisines are the lowest (which indicate the best). Also, the higher quality Chinese and Japanese restaurants can be found in Bronx. Some statistical records catching our attention are that the average score of Chinese restaurant in Manhattan is 23, which is much higher than any other cuisine in any different boroughs, and the lowest average score is 15, which is the average score of Japanese restaurants in Bronx.

Manhattan Data

Manhattan_Cuisines <- Cuisines %>%
  group_by(BORO, Cuisine) %>%
  filter(BORO == "MANHATTAN") %>%
  summarise(Count = sum(Cuisine == "1", Cuisine == "2",
                        Cuisine == "3", Cuisine == "4",
                        Cuisine == "5"),
            AverageScore = mean(SCORE, na.rm = TRUE),
            Percentage = ( Count / (43622+10526+9599+7917+4874) ) * 100 ) 
Manhattan_Cuisines$Cuisine <- factor(Manhattan_Cuisines$Cuisine, levels = c(1, 2, 3, 4, 5),
                             labels = c("American", "Chinese", "Italian", "Japanese", "Mexican"))
Manhattan_Cuisines <- Manhattan_Cuisines[order(-Manhattan_Cuisines$Count) , ]
kable(Manhattan_Cuisines, caption = "Manhattan")
Manhattan
BORO Cuisine Count AverageScore Percentage
MANHATTAN American 43622 18.30010 56.993912
MANHATTAN Chinese 10526 23.14858 13.752646
MANHATTAN Italian 9599 18.65371 12.541483
MANHATTAN Japanese 7917 20.02893 10.343882
MANHATTAN Mexican 4874 19.92573 6.368079

The table shows that the most popular cuisine in Manhattan is American (accounting for approximately 57% of all the restaurants in the borough). The other cuisines follow with Mexican being the least popular in Manhattan (accounting for about 6.37%). American cuisine also has the lowest score (18.3), which means that it is the best compared to the other cuisines. The worst score (or the highest one) belongs to Chinese cuisine, with a score of 13.75.

Manhattan residuals

#filtering Manhattan data residuals
Manhattan_Residuals <- Cuisines %>%
  filter(BORO == "MANHATTAN")

#linear model
Manhattan_Cuisines_mod <- lm(data = Manhattan_Residuals, SCORE ~ Cuisine)
summary(Manhattan_Cuisines_mod)
## 
## Call:
## lm(formula = SCORE ~ Cuisine, data = Manhattan_Residuals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.218  -8.766  -3.863   5.234  96.330 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 18.41126    0.08319  221.30   <2e-16 ***
## Cuisine      0.45169    0.03549   12.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.73 on 76536 degrees of freedom
## Multiple R-squared:  0.002112,   Adjusted R-squared:  0.002099 
## F-statistic:   162 on 1 and 76536 DF,  p-value: < 2.2e-16

The linear model demonstrates a weak positive correlation between Cuisine and Score. The slope shows that there is a positive correlation; however, both the multiple and adjusted R-squared values are very low, which contradicts the correlation. The p-values are all below 0.05, so it can be tentatively concluded that the positive correlation between cuisine and score is not reliable. Thus, there is almost no relationship between cuisine and score.

#making Manhattan data's residuals table
Manhattan_Residuals <- Manhattan_Residuals %>%
  dplyr::select(Cuisine, SCORE) %>%
  mutate(residual = resid(Manhattan_Cuisines_mod))

Manhattan_Residuals$Cuisine <- factor(Manhattan_Residuals$Cuisine, levels = c(1, 2, 3, 4, 5),
                             labels = c("American", "Chinese", "Italian", "Japanese", "Mexican"))

#histogram
ggplot(Manhattan_Residuals, aes(residual)) + 
  geom_histogram() +
  theme_tufte() +
  labs(x="Residuals", title="Residuals of Manhattan Restaurants Score")

The histogram of the residuals shows that there are outliers in the scores given to restaurants in Manhattan. It is largely skewed to the right, which indicates that the true mean (or the real mean) is lower than the predicted mean.

#boxplot
ggplot(Manhattan_Residuals,aes(x=factor(Cuisine),y=SCORE, colour = Cuisine))+
  geom_boxplot(notch = TRUE) +
  labs(x="Cuisines", y = "Score", title = "Scores of restaurants in Manhattan by Cuisines") 

The boxplots for scores of different cuisines in Manhattan show that scores of American cuisine restaurants have a lot of outliers compared to those of Chinese cuisine restaurants. This may imply that the average score for American cuisine, though being the lowest, is not reliable. It also applies to Chinese cuisine restaurants’ score since it has the worst score (or the highest); their outliers are less than those of American, so this may indicate that Chinese restaurants’ scores are not that high and thus not the worst like what the table has previously shown.

t-tests on top 5 cuisines in Manhattan

Manhattan_American <- Cuisines %>%
  dplyr::select(Cuisine, SCORE, BORO) %>%
  filter(BORO == "MANHATTAN", Cuisine == "1") %>%
  mutate(AverageScore = sum(SCORE)/43622)

t.test(Manhattan_American$SCORE, H0 = mu,conf.level=0.95)
## 
##  One Sample t-test
## 
## data:  Manhattan_American$SCORE
## t = 314.83, df = 43621, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  18.18617 18.41403
## sample estimates:
## mean of x 
##   18.3001
Manhattan_Chinese <- Cuisines %>%
  dplyr::select(Cuisine, SCORE, BORO) %>%
  filter(BORO == "MANHATTAN", Cuisine == "2") %>%
  mutate(AverageScore = sum(SCORE)/10526)

t.test(Manhattan_Chinese$SCORE,H0 = mu,conf.level=0.95)
## 
##  One Sample t-test
## 
## data:  Manhattan_Chinese$SCORE
## t = 163.99, df = 10525, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  22.87189 23.42528
## sample estimates:
## mean of x 
##  23.14858
Manhattan_Italian <- Cuisines %>%
  dplyr::select(Cuisine, SCORE, BORO) %>%
  filter(BORO == "MANHATTAN", Cuisine == "3") %>%
  mutate(AverageScore = sum(SCORE)/9599)

t.test(Manhattan_Italian$SCORE,H0 = mu,conf.level=0.95)
## 
##  One Sample t-test
## 
## data:  Manhattan_Italian$SCORE
## t = 157.4, df = 9598, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  18.42140 18.88603
## sample estimates:
## mean of x 
##  18.65371
Manhattan_Japanese <- Cuisines %>%
  dplyr::select(Cuisine, SCORE, BORO) %>%
  filter(BORO == "MANHATTAN", Cuisine == "4") %>%
  mutate(AverageScore = sum(SCORE)/7917)

t.test(Manhattan_Japanese$SCORE,H0 = mu,conf.level=0.95)
## 
##  One Sample t-test
## 
## data:  Manhattan_Japanese$SCORE
## t = 137.38, df = 7916, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  19.74312 20.31473
## sample estimates:
## mean of x 
##  20.02893
Manhattan_Mexican <- Cuisines %>%
  dplyr::select(Cuisine, SCORE, BORO) %>%
  filter(BORO == "MANHATTAN", Cuisine == "5") %>%
  mutate(AverageScore = sum(SCORE)/4874)

t.test(Manhattan_Mexican$SCORE,H0 = mu,conf.level=0.95)
## 
##  One Sample t-test
## 
## data:  Manhattan_Mexican$SCORE
## t = 99.726, df = 4873, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  19.53402 20.31743
## sample estimates:
## mean of x 
##  19.92573

The t-tests show that the null hypotheses of the relationship between score and its mean value are no different are all rejected for all cuisines with the p-values lower than 2.2e-16 (which satisfies the benchmark of 0.05). The 95 percent confidence interval for the scores of Chinese cuisine and American cuisine have a wider range than the others, which indicate a more accurate result. However, compared to the range of the scores we interpreted before, the 95 precent confidence intervals for all cuisine are very narrow. This goes well with what has been observed in the boxplots.

Queens Data

Queens_Cuisines <- Cuisines %>%
  group_by(BORO, Cuisine) %>%
  filter(BORO == "QUEENS") %>%
  summarise(Count = sum(Cuisine == "1", Cuisine == "2",
                        Cuisine == "3", Cuisine == "4",
                        Cuisine == "5"),
            AverageScore = mean(SCORE, na.rm = TRUE),
            Percentage = ( Count / (13638 + 11079 + 2962 + 2046 + 1977)) * 100 )

Queens_Cuisines$Cuisine <- factor(Queens_Cuisines$Cuisine, levels = c(1, 2, 3, 4, 5),
                             labels = c("American", "Chinese", "Italian", "Japanese", "Mexican"))
Queens_Cuisines <- Queens_Cuisines[order(-Queens_Cuisines$Count) , ]
kable(Queens_Cuisines, caption = "Queens")
Queens
BORO Cuisine Count AverageScore Percentage
QUEENS American 13638 17.13697 43.019368
QUEENS Chinese 11079 20.05533 34.947322
QUEENS Mexican 2962 19.19818 9.343259
QUEENS Italian 2046 17.54399 6.453851
QUEENS Japanese 1977 17.41325 6.236200

The table shows that the most popular cuisine in Queens is American (accounting for approximately 43% of all the restaurants in the borough). The other cuisines follow with Japanese being the least popular in Queens (accounting for about 6.24%). American cuisine also has the lowest score (17.1), which means that it is the best compared to the other cuisines. The worst score (or the highest one) belongs to Chinese cuisine with a score of 20.06.

Queens residuals

Queens_Residuals <- Cuisines %>%
  filter(BORO == "QUEENS")

Queens_Cuisines_mod <- lm(data = Queens_Residuals, SCORE ~ Cuisine)
summary(Queens_Cuisines_mod)
## 
## Call:
## lm(formula = SCORE ~ Cuisine, data = Queens_Residuals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.494  -8.006  -5.006   4.994  86.622 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 17.63434    0.13513 130.500  < 2e-16 ***
## Cuisine      0.37198    0.05639   6.597 4.27e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.63 on 31700 degrees of freedom
## Multiple R-squared:  0.001371,   Adjusted R-squared:  0.001339 
## F-statistic: 43.51 on 1 and 31700 DF,  p-value: 4.273e-11

The linear model demonstrates a weak positive correlation between Cuisine and Score. The slope shows that there is a positive correlation; however, both the multiple and adjusted R-squared values are very low, which indicates that the correlation between two variables are very weak. The p-values are all very much below 0.05, so it can be tentatively concluded that the positive correlation between cuisine and score is not reliable. Thus, there is almost no relationship between cuisine and score.

Queens_Residuals <- Queens_Residuals %>%
  dplyr::select(Cuisine, SCORE) %>%
  mutate(residual = resid(Queens_Cuisines_mod))

Queens_Residuals$Cuisine <- factor(Queens_Residuals$Cuisine, levels = c(1, 2, 3, 4, 5),
                             labels = c("American", "Chinese", "Italian", "Japanese", "Mexican"))
ggplot(Queens_Residuals, aes(residual)) + 
  geom_histogram() +
  theme_tufte() +
  labs(x="Residuals", title = "Residuals of Queens restaurants score")

The histogram of the residuals shows that there are outliers in the scores given to restaurants in Queens. Much like Manhattan???s, it is skewed to the right, which indicates that the true mean (or the real mean) is lower than the predicted mean. This affects the validity of the average score of each borough.

ggplot(Queens_Residuals,aes(x=factor(Cuisine),y=SCORE, colour = Cuisine)) +
  geom_boxplot(notch = TRUE) +
  labs(x="Cuisines", y ="Score", "Residuals", title = "Scores of restaurants in Queens by Cuisines")

The boxplots for scores of different cuisines in Queens show that scores of both American and Chinese cuisine restaurants have a lot of outliers compared to those of others. This may imply that there are a lot of good restaurants (according to the inspection scores) in these two cuisines. Though the average scores may say otherwise, the outliers have clearly indicated that the there are quite a big number of Chinese restaurants that are of good quality. Nevertheless, the fact that American restaurants’ score has that many outliers yet still has the lowest score shows that in general, they still have the best restaurants in terms of inspection scores. For the other cuisines, there are fewer outliers.

t-tests on top 5 cuisines in Queens

Queens_American <- Cuisines %>%
  dplyr::select(Cuisine, SCORE, BORO) %>%
  filter(BORO == "QUEENS", Cuisine == "1") %>%
  mutate(AverageScore = sum(SCORE)/13638)

t.test(Queens_American$SCORE,H0 = mu,conf.level=0.95)
## 
##  One Sample t-test
## 
## data:  Queens_American$SCORE
## t = 173.1, df = 13637, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  16.94291 17.33103
## sample estimates:
## mean of x 
##  17.13697
Queens_Chinese <- Cuisines %>%
  dplyr::select(Cuisine, SCORE, BORO) %>%
  filter(BORO == "QUEENS", Cuisine == "2") %>%
  mutate(AverageScore = sum(SCORE)/11079)

t.test(Queens_Chinese$SCORE,H0 = mu,conf.level=0.95)
## 
##  One Sample t-test
## 
## data:  Queens_Chinese$SCORE
## t = 147.92, df = 11078, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  19.78957 20.32109
## sample estimates:
## mean of x 
##  20.05533
Queens_Italian <- Cuisines %>%
  dplyr::select(Cuisine, SCORE, BORO) %>%
  filter(BORO == "QUEENS", Cuisine == "3") %>%
  mutate(AverageScore = sum(SCORE)/2046)

t.test(Queens_Italian$SCORE,H0 = mu,conf.level=0.95)
## 
##  One Sample t-test
## 
## data:  Queens_Italian$SCORE
## t = 69.752, df = 2045, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  17.05073 18.03725
## sample estimates:
## mean of x 
##  17.54399
Queens_Japanese <- Cuisines %>%
  dplyr::select(Cuisine, SCORE, BORO) %>%
  filter(BORO == "QUEENS", Cuisine == "4") %>%
  mutate(AverageScore = sum(SCORE)/1977)

t.test(Queens_Japanese$SCORE,H0 = mu,conf.level=0.95)
## 
##  One Sample t-test
## 
## data:  Queens_Japanese$SCORE
## t = 68.172, df = 1976, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  16.91231 17.91420
## sample estimates:
## mean of x 
##  17.41325
Queens_Mexican <- Cuisines %>%
  dplyr::select(Cuisine, SCORE, BORO) %>%
  filter(BORO == "QUEENS", Cuisine == "5") %>%
  mutate(AverageScore = sum(SCORE)/2962)

t.test(Queens_Mexican$SCORE,H0 = mu,conf.level=0.95)
## 
##  One Sample t-test
## 
## data:  Queens_Mexican$SCORE
## t = 88.669, df = 2961, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  18.77364 19.62271
## sample estimates:
## mean of x 
##  19.19818

The t-tests show that the null hypotheses of the relation between score and its mean value are no different are all rejected with the p-values lower than 2.2e-16 (which satisfies the benchmark of 0.05) and large t-values (ranging from approximately 68 to 173). The scores are significantly different from 0, which is right considering the hypothesis.

Brooklyn Data

Brooklyn_Cuisines <- Cuisines %>%
  group_by(BORO, Cuisine) %>%
  filter(BORO == "BROOKLYN") %>%
  summarise(Count = sum(Cuisine == "1", Cuisine == "2",
                        Cuisine == "3", Cuisine == "4",
                        Cuisine == "5"),
            AverageScore = mean(SCORE, na.rm = TRUE),
            Percentage = ( Count / (17973 + 12494 + 4180 + 2828 + 2731) ) * 100 )

Brooklyn_Cuisines$Cuisine <- factor(Brooklyn_Cuisines$Cuisine, levels = c(1, 2, 3, 4, 5),
                             labels = c("American", "Chinese", "Italian", "Japanese", "Mexican"))
Brooklyn_Cuisines <- Brooklyn_Cuisines[order(Brooklyn_Cuisines$AverageScore) , ]
kable(Brooklyn_Cuisines, caption = "Brooklyn")
Brooklyn
BORO Cuisine Count AverageScore Percentage
BROOKLYN American 17973 18.18839 44.702283
BROOKLYN Italian 2731 18.27536 6.792518
BROOKLYN Chinese 12494 20.11389 31.074964
BROOKLYN Japanese 2828 20.28324 7.033776
BROOKLYN Mexican 4180 20.67153 10.396458

The table shows that the most popular cuisine in Brooklyn is American (accounting for approximately 45% of all the restaurants in the borough). The other cuisines follow with Italian being the least popular (accounting for about 6.24%). American cuisine also has the lowest score (18.19), which means that it is the best compared to the other cuisines. The worst score (or the highest one) belongs to Mexican cuisine with a score of 20.67.

Brooklyn residuals

Brooklyn_Residuals <- Cuisines %>%
  filter(BORO == "BROOKLYN")

Brooklyn_Cuisines_mod <- lm(data = Brooklyn_Residuals, SCORE ~ Cuisine)
summary(Brooklyn_Cuisines_mod)
## 
## Call:
## lm(formula = SCORE ~ Cuisine, data = Brooklyn_Residuals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.895  -8.576  -4.156   5.424  91.424 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 17.99618    0.12255  146.85   <2e-16 ***
## Cuisine      0.57969    0.04992   11.61   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.15 on 40204 degrees of freedom
## Multiple R-squared:  0.003343,   Adjusted R-squared:  0.003318 
## F-statistic: 134.8 on 1 and 40204 DF,  p-value: < 2.2e-16

The linear model demonstrates a weak positive correlation between Cuisine and Score. The slope shows that there is a positive correlation; however, both the multiple and adjusted R-squared values are very low, which indicates that the correlation between two variables are very weak. The p-values are all 0.05, so it can be tentatively concluded that the positive correlation between cuisine and score is not reliable. Thus, there is almost no relationship between cuisine and score.

Brooklyn_Residuals <- Brooklyn_Residuals %>%
  dplyr::select(Cuisine, SCORE) %>%
  mutate(residual = resid(Brooklyn_Cuisines_mod))

Brooklyn_Residuals$Cuisine <- factor(Brooklyn_Residuals$Cuisine, levels = c(1, 2, 3, 4, 5),
                             labels = c("American", "Chinese", "Italian", "Japanese", "Mexican"))

ggplot(Brooklyn_Residuals, aes(residual)) + 
  geom_histogram() +
  theme_tufte() +
  labs(x="Residuals", title = "Residuals of Brooklyn restaurants score")

The histogram of the residuals shows that there are outliers in the scores given to restaurants in Brooklyn. Much like other boroughs???, it is skewed to the right, which indicates that the true mean (or the real mean) is lower than the predicted mean. This further implies that the predicted means are very different from the true means.

ggplot(Brooklyn_Residuals,aes(x=factor(Cuisine),y=SCORE, colour = Cuisine)) +
  geom_boxplot(notch=TRUE) +
  labs(x="Cuisines", y ="Score", title = "Scores of restaurants in Brooklyn by Cuisines")

Similar to the previous boxplots, it can be seen that there are a lot of outliers for American and Chinese restaurants’ scores. This indicates that though the score for Chinese is the highest, the numerous outliers imply that not all of the restaurants receive a bad score. Instead, there are some good restaurants, too. It also applies to American, but considering that its score is the lowest (which means that the cuisine has some of the best restaurants), there must be some very high-quality restaurants of this cuisine in Brooklyn. In addition, Japanese and Mexican cuisine restaurants also have a number of outliers that might affect their scores the same way with Chinese restaurants.

t-tests on top 5 cuisines in Brooklyn

Brooklyn_American <- Cuisines %>%
  dplyr::select(Cuisine, SCORE, BORO) %>%
  filter(BORO == "BROOKLYN", Cuisine == "1") %>%
  mutate(AverageScore = sum(SCORE)/17973)

t.test(Brooklyn_American$SCORE,H0 = mu,conf.level=0.95)
## 
##  One Sample t-test
## 
## data:  Brooklyn_American$SCORE
## t = 195.84, df = 17972, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  18.00635 18.37044
## sample estimates:
## mean of x 
##  18.18839
Brooklyn_Chinese <- Cuisines %>%
  dplyr::select(Cuisine, SCORE, BORO) %>%
  filter(BORO == "BROOKLYN", Cuisine == "2") %>%
  mutate(AverageScore = sum(SCORE)/12494)

t.test(Brooklyn_Chinese$SCORE,H0 = mu,conf.level=0.95)
## 
##  One Sample t-test
## 
## data:  Brooklyn_Chinese$SCORE
## t = 164.74, df = 12493, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  19.87457 20.35322
## sample estimates:
## mean of x 
##  20.11389
Brooklyn_Italian <- Cuisines %>%
  dplyr::select(Cuisine, SCORE, BORO) %>%
  filter(BORO == "BROOKLYN", Cuisine == "3") %>%
  mutate(AverageScore = sum(SCORE)/2731)

t.test(Brooklyn_Italian$SCORE,H0 = mu,conf.level=0.95)
## 
##  One Sample t-test
## 
## data:  Brooklyn_Italian$SCORE
## t = 83.441, df = 2730, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  17.84590 18.70482
## sample estimates:
## mean of x 
##  18.27536
Brooklyn_Japanese <- Cuisines %>%
  dplyr::select(Cuisine, SCORE, BORO) %>%
  filter(BORO == "BROOKLYN", Cuisine == "4") %>%
  mutate(AverageScore = sum(SCORE)/2828)

t.test(Brooklyn_Japanese$SCORE,H0 = mu,conf.level=0.95)
## 
##  One Sample t-test
## 
## data:  Brooklyn_Japanese$SCORE
## t = 74.3, df = 2827, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  19.74796 20.81852
## sample estimates:
## mean of x 
##  20.28324
Brooklyn_Mexican <- Cuisines %>%
  dplyr::select(Cuisine, SCORE, BORO) %>%
  filter(BORO == "BROOKLYN", Cuisine == "5") %>%
  mutate(AverageScore = sum(SCORE)/4180)

t.test(Brooklyn_Mexican$SCORE,H0 = mu,conf.level=0.95)
## 
##  One Sample t-test
## 
## data:  Brooklyn_Mexican$SCORE
## t = 92.686, df = 4179, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  20.23428 21.10878
## sample estimates:
## mean of x 
##  20.67153

The t-tests show that the null hypotheses of the relation between score and its mean value are no different are all rejected with the p-values lower than 2.2e-16 (which satisfies the benchmark of 0.05) and large t-values (ranging from approximately 74.3 to 196). The scores are significantly different from 0, which is right considering the hypothesis. The range of the 95 percent confidence interval is small, which indicates a lower level of accuracy.

Bronx Data

Bronx_Cuisines <- Cuisines %>%
  group_by(BORO, Cuisine) %>%
  filter(BORO == "BRONX") %>%
  summarise(Count = sum(Cuisine == "1", Cuisine == "2",
                        Cuisine == "3", Cuisine == "4",
                        Cuisine == "5"),
            AverageScore = mean(SCORE, na.rm = TRUE),
            Percentage = ( Count / (5430 + 4116 + 1727 + 883 + 266)) * 100 )

Bronx_Cuisines$Cuisine <- factor(Bronx_Cuisines$Cuisine, levels = c(1, 2, 3, 4, 5),
                             labels = c("American", "Chinese", "Italian", "Japanese", "Mexican"))
Bronx_Cuisines <- Bronx_Cuisines[order(Bronx_Cuisines$AverageScore) , ]
kable(Bronx_Cuisines, caption = "Bronx")
Bronx
BORO Cuisine Count AverageScore Percentage
BRONX Japanese 266 15.24812 2.141362
BRONX Chinese 4116 16.10982 33.134761
BRONX American 5430 17.69871 43.712768
BRONX Italian 883 19.66931 7.108356
BRONX Mexican 1727 20.00174 13.902753

The table shows that the most popular cuisine in Bronx is American (accounting for approximately 44% of all the restaurants in the borough). The other cuisines follow with Japanese being the least popular (accounting for about 2.14%). Japanese restaurants also have the lowest score (15.25), which means that it is the best compared to the other cuisines. The worst score (or the highest one) belongs to Mexican cuisine with a score of 20.

Bronx residuals

Bronx_Residuals <- Cuisines %>%
  filter(BORO == "BRONX")

Bronx_Cuisines_mod <- lm(data = Bronx_Residuals, SCORE ~ Cuisine)
summary(Bronx_Cuisines_mod)
## 
## Call:
## lm(formula = SCORE ~ Cuisine, data = Bronx_Residuals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.173  -7.529  -3.980   4.923  81.020 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  16.4320     0.1960  83.820  < 2e-16 ***
## Cuisine       0.5483     0.0786   6.976  3.2e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.87 on 12420 degrees of freedom
## Multiple R-squared:  0.003902,   Adjusted R-squared:  0.003822 
## F-statistic: 48.66 on 1 and 12420 DF,  p-value: 3.201e-12

The linear model demonstrates a weak positive correlation between Cuisine and Score. The slope shows that there is a positive correlation; however, both the multiple and adjusted R-squared values are very low, which indicates the correlation between two variables are very weak. One thing to note is that out of every borough, Bronx has the highest R-squared value, which means that it has a slightly stronger correlation The p-values are all very much below 0.05, so it can be tentatively concluded that the positive correlation between cuisine and score is not reliable. Thus, though the correlation is a little stronger, there is almost no relationship between cuisine and score.

Bronx_Residuals <- Bronx_Residuals %>%
  dplyr::select(Cuisine, SCORE) %>%
  mutate(residual = resid(Bronx_Cuisines_mod))

Bronx_Residuals$Cuisine <- factor(Bronx_Residuals$Cuisine, levels = c(1, 2, 3, 4, 5),
                             labels = c("American", "Chinese", "Italian", "Japanese", "Mexican"))

ggplot(Bronx_Residuals, aes(residual)) + 
  geom_histogram() +
  theme_tufte() +
  labs(x="Residuals", title = "Residuals of Bronx restaurants score")

The histogram of the residuals shows that there are outliers in the scores given to restaurants in Bronx. Much like other boroughs???, it is skewed to the right, which indicates that the true mean (or the real mean) is lower than the predicted mean. This further supports the previous assumption that the true mean is different from what is shown by the average scores calculated in the tables.

ggplot(Bronx_Residuals,aes(x=factor(Cuisine),y=SCORE, colour = Cuisine))+
  geom_boxplot(notch=TRUE) +
  labs(x="Cuisines", y ="Score", title = "Scores of restaurants in Bronx by Cuisines")

The boxplots for scores of different cuisines in Bronx show that for every cuisine except Japanese there are some outliers. The number of outliers is significantly less than that of other boroughs. This is probably because there are less restaurants in Bronx than in other boroughs. It can be seen from the boxplots that, similar to other boroughs, the outliers for American and Chinese restaurants are the largest, which indicates almost the same thing as the previous boxplots. One thing to note in these barplots is that there are no outliers for Japanese restaurants’ scores. This implies that the scores for Japanese restaurants are very consistently low, which means that those restaurants are of the higher quality.

t-tests on top 5 cuisines in Bronx

Bronx_American <- Cuisines %>%
  dplyr::select(Cuisine, SCORE, BORO) %>%
  filter(BORO == "BRONX", Cuisine == "1") %>%
  mutate(AverageScore = sum(SCORE)/5430)

t.test(Bronx_American$SCORE,H0 = mu,conf.level=0.95)
## 
##  One Sample t-test
## 
## data:  Bronx_American$SCORE
## t = 107.67, df = 5429, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  17.37645 18.02097
## sample estimates:
## mean of x 
##  17.69871
Bronx_Chinese <- Cuisines %>%
  dplyr::select(Cuisine, SCORE, BORO) %>%
  filter(BORO == "BRONX", Cuisine == "2") %>%
  mutate(AverageScore = sum(SCORE)/4116)

t.test(Bronx_Chinese$SCORE,H0 = mu,conf.level=0.95)
## 
##  One Sample t-test
## 
## data:  Bronx_Chinese$SCORE
## t = 99.308, df = 4115, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  15.79178 16.42786
## sample estimates:
## mean of x 
##  16.10982
Bronx_Italian <- Cuisines %>%
  dplyr::select(Cuisine, SCORE, BORO) %>%
  filter(BORO == "BRONX", Cuisine == "3") %>%
  mutate(AverageScore = sum(SCORE)/883)

t.test(Bronx_Italian$SCORE,H0 = mu,conf.level=0.95)
## 
##  One Sample t-test
## 
## data:  Bronx_Italian$SCORE
## t = 43.139, df = 882, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  18.77443 20.56419
## sample estimates:
## mean of x 
##  19.66931
Bronx_Japanese <- Cuisines %>%
  dplyr::select(Cuisine, SCORE, BORO) %>%
  filter(BORO == "BRONX", Cuisine == "4") %>%
  mutate(AverageScore = sum(SCORE)/266)

t.test(Bronx_Japanese$SCORE,H0 = mu,conf.level=0.95)
## 
##  One Sample t-test
## 
## data:  Bronx_Japanese$SCORE
## t = 33.029, df = 265, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  14.33913 16.15711
## sample estimates:
## mean of x 
##  15.24812
Bronx_Mexican <- Cuisines %>%
  dplyr::select(Cuisine, SCORE, BORO) %>%
  filter(BORO == "BRONX", Cuisine == "5") %>%
  mutate(AverageScore = sum(SCORE)/1727)

t.test(Bronx_Mexican$SCORE,H0 = mu,conf.level=0.95)
## 
##  One Sample t-test
## 
## data:  Bronx_Mexican$SCORE
## t = 61.675, df = 1726, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  19.36566 20.63782
## sample estimates:
## mean of x 
##  20.00174

The t-tests show that the null hypotheses of the relation between score and its mean value are no different are all rejected with the p-values lower than 2.2e-16 (which satisfies the benchmark of 0.05) and large t-values (ranging from approximately 33 to 106). The scores are significantly different from 0, which is right considering the hypothesis. The range of the 95 percent confidence interval for Italian and Japanese restaurants’ scores are slightly bigger than the others, which indicate a larger range of scores for these two cuisines.

Staten Island Data

StatenIsland_Cuisines <- Cuisines %>%
  group_by(BORO, Cuisine) %>%
  filter(BORO == "STATEN ISLAND") %>%
  summarise(Count = sum(Cuisine == "1", Cuisine == "2",
                        Cuisine == "3", Cuisine == "4",
                        Cuisine == "5"),
            AverageScore = mean(SCORE, na.rm = TRUE),
            Percentage = ( Count / (3125 + 1245 + 1174 + 552 + 549)) * 100 )
StatenIsland_Cuisines$Cuisine <- factor(StatenIsland_Cuisines$Cuisine, levels = c(1, 2, 3, 4, 5),
                             labels = c("American", "Chinese", "Italian", "Japanese", "Mexican"))
StatenIsland_Cuisines <- StatenIsland_Cuisines[order(StatenIsland_Cuisines$AverageScore) , ]
kable(StatenIsland_Cuisines, caption = "Staten Island")
Staten Island
BORO Cuisine Count AverageScore Percentage
STATEN ISLAND Chinese 1174 18.49319 17.667419
STATEN ISLAND American 3125 19.65568 47.027841
STATEN ISLAND Italian 1245 20.04337 18.735892
STATEN ISLAND Mexican 549 20.10929 8.261851
STATEN ISLAND Japanese 552 20.41486 8.306998

The table shows that the most popular cuisine in Staten Island is American (accounting for approximately 47% of all the restaurants in the borough). The other cuisines follow with Mexican being the least popular (accounting for about 8.26%). Chinese restaurants have the lowest score (18.49), which means that it is the best compared to the other cuisines. The worst score (or the highest one) belongs to Japanese cuisine with a score of 20.41.

Staten Island residuals

StatenIsland_Residuals <- Cuisines %>%
  filter(BORO == "STATEN ISLAND")

StatenIsland_Cuisines_mod <- lm(data = StatenIsland_Residuals, SCORE ~ Cuisine)
summary(StatenIsland_Cuisines_mod)
## 
## Call:
## lm(formula = SCORE ~ Cuisine, data = StatenIsland_Residuals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.151  -8.416  -3.416   5.217  79.584 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  19.2318     0.2934  65.541   <2e-16 ***
## Cuisine       0.1838     0.1173   1.567    0.117    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.53 on 6643 degrees of freedom
## Multiple R-squared:  0.0003694,  Adjusted R-squared:  0.0002189 
## F-statistic: 2.455 on 1 and 6643 DF,  p-value: 0.1172

The linear model demonstrates a weak positive correlation between Cuisine and Score. The slope shows that there is a positive correlation; however, both the multiple and adjusted R-squared values are very low, which indicates that the correlation between two variables are very weak. One thing to note is that the linear model for Staten Island also has the lowest R-squared values. Most of the p-values are all very much below 0.05, so it can be tentatively concluded that the positive correlation between cuisine and score is not reliable. Thus, there is almost no relationship between cuisine and score.

StatenIsland_Residuals <- StatenIsland_Residuals %>%
  dplyr::select(Cuisine, SCORE) %>%
  mutate(residual = resid(StatenIsland_Cuisines_mod))

StatenIsland_Residuals$Cuisine <- factor(StatenIsland_Residuals$Cuisine, levels = c(1, 2, 3, 4, 5),
                             labels = c("American", "Chinese", "Italian", "Japanese", "Mexican"))

ggplot(StatenIsland_Residuals, aes(residual)) + 
  geom_histogram() +
  theme_tufte() +
  labs(x="Residuals", title = "Residuals of Staten Island restaurants score")

The histogram of the residuals shows that there are outliers in the scores given to restaurants in Staten Island. Much like other boroughs???, it is skewed to the right, which indicates that the true mean (or the real mean) is lower than the predicted mean. However, compared to other boroughs???, Staten Island???s histogram is slighly less skewed, suggesting a better reliability in its true average score.

ggplot(StatenIsland_Residuals,aes(x=Cuisine,y=SCORE, colour = Cuisine)) +
  geom_boxplot(notch=TRUE) +
  labs(x="Cuisines", y ="Score", title = "Scores of restaurants in Staten Island by Cuisines")

The boxplots for the scores of restaurants in Staten Island show that the most outliers are in American restaurants’ scores. The scores for other restaurants of different cuisines do not have as many outliers as American restaurants’. Chinese restaurants in Staten Island, in general, do not have as many outliers as those in other boroughs. This and the fact that their scores are the lowest indicate that the quality of most of the Chinese restaurants in this borough is quite high.

t-tests on top 5 cuisines in Staten Island

## 
##  One Sample t-test
## 
## data:  StatenIsland_American$SCORE
## t = 84.593, df = 3124, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  19.20009 20.11127
## sample estimates:
## mean of x 
##  19.65568
## 
##  One Sample t-test
## 
## data:  StatenIsland_Chinese$SCORE
## t = 55.926, df = 1173, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  17.84441 19.14196
## sample estimates:
## mean of x 
##  18.49319
## 
##  One Sample t-test
## 
## data:  StatenIsland_Italian$SCORE
## t = 55.39, df = 1244, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  19.33345 20.75330
## sample estimates:
## mean of x 
##  20.04337
## 
##  One Sample t-test
## 
## data:  StatenIsland_Japanese$SCORE
## t = 36.691, df = 551, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  19.32192 21.50779
## sample estimates:
## mean of x 
##  20.41486
## 
##  One Sample t-test
## 
## data:  StatenIsland_Mexican$SCORE
## t = 42.931, df = 548, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  19.18919 21.02939
## sample estimates:
## mean of x 
##  20.10929

The t-tests show that the null hypotheses of the relation between score and its mean value are no different are all rejected with the p-values lower than 2.2e-16 (which satisfies the benchmark of 0.05) and large t-values (ranging from approximately 37 to 85). The scores are significantly different from 0, which is right considering the hypothesis. The range of the 95 percent confidence interval for Chinese restaurants’ scores are bigger compared to that of other cuisines’ scores, which indicate there is a 5 percent chance of the scores straying far away from the mean.

Cuisines by Boroughs

Boro_ggplot <- rbind(Manhattan_Cuisines, Bronx_Cuisines, Brooklyn_Cuisines, StatenIsland_Cuisines, Queens_Cuisines)

Boro_ggplot$Percentage <- round(Boro_ggplot$Percentage, 3)

ggplot(Boro_ggplot, aes( x = BORO, y = Percentage, fill = Cuisine)) +
   geom_bar(position = position_stack(), stat = "identity", width = .7) +
  scale_fill_manual(values = c(American = "brown 1", Chinese = "gold", Italian = "light green",
                                Japanese = "sky blue", Mexican = "orange")) +
  geom_text(aes(label = Percentage), position = position_stack(vjust = 0.5), size = 2.5) +
   theme(plot.title = element_text(hjust = 0.5)) +
  labs(title = "Percentages of each Cuisine in Different Boroughs", x = "Borough")

The bar plot shows the distribution of the top five cuisines in different boroughs in New York City. The majority of the cuisines in each borough is American. The second most popular cuisine in Bronx, Brooklyn, Queens, and Manhattan are Chinese, while in Staten Island it is Italian. Most Chinese restaurants are in Queens, American and Japanese restaurants in Manhattan, and Italian restaurants in Staten Island.

Line graph of 10F changes in past four years

InspDate_10F_Data <- Cuisines %>%
  dplyr::select(BORO, ViolationCode, InspectionDate) %>%
  group_by(BORO, InspectionDate,ViolationCode ) %>%
  filter(ViolationCode == "10F") %>%
  mutate(year=as.numeric(substr(InspectionDate,7,10)))

Year_10F_data <- InspDate_10F_Data %>%
  dplyr::select(ViolationCode, year,BORO,InspectionDate) %>%
  group_by(ViolationCode, year,BORO) %>%
  summarise(count = sum(as.numeric(ViolationCode)))
ggplot(data = Year_10F_data) +
  aes(x=year , y=count, color = BORO)+
  geom_line() +
  labs(x="Year", y = "Count of Violation code 10F", title = "Trends of Violation code 10F for all Cuisine in Different Boroughs", colour = "Borough") 

The line graph shows a steady increase in violation code 10F from 2013 to 2015, where the number of this violation code in every brough began to decrease, especially that of Manhattan. We have yet to find out what is the reason behind this trend. This is also what needs to be noted in future analysis.

Conclusion

In our analysis, we used many variables, including Score, Borough, Cuisine, InspectionDate, and ViolationCode. Score was used to conduct t-tests and demonstrate the trend over the years (the lower the score, the higher the quality of the restaurants). Variables such as Cuisine and ViolationCode were used to find out the distribution of cuisines and violation codes in each borough. The results from the t-tests may suggest the correlation between the locations and the cuisines, and the graphs and tables may demonstrate the difference in the average score of each cuisine. Our tentative findings include the most popular cuisine in New York City (American) and the borough with the best score (Bronx). Through the statistical tests, linear regressions, and multiple plots, we have also found out that the average scores of either boroughs or cuisines cannot be trusted completely. The reason is that there are many outliers in the scores, which largely affect the true mean. This means that, though the score is one of the clearest indicators of the quality of the restaurants, they should not be used to represent the overall quality. In addition, they are not to be used to measure the ratings of a restaurant regarding how enjoyable or appetizing its food is, either. This is a common misunderstanding that the public usually has when it comes to grades or scores.

Thus, the limitation to this analysis is the accuracy. Several assumptions can be made, but the conclusion is yet to be reached.

To sum up, the analysis has provided a detailed insight of the New York City Restaurants Inspection dataset. A lot of information has been obtained through the analysing process. Most of them are useful; nevertheless, more research and analysis are needed to reach a complete conclusion. One direction for future analyses is to investigate further into the outliers of the scores column in the dataset. Another direction is to conduct more research on New York City Restaurants Inspection policies in order to have a deeper understanding of the scoring system and the trends of violation codes across the years.

Sources

Bloomberg, Michael R., and Thomas Farley. ‘Restaurant Grading in New York City at 18 Months.’ NYC Health, www1.nyc.gov/assets/doh/downloads/pdf/rii/restaurant-grading-18-month-report.pdf

Dai, Serena. ‘Yes, the Number of Chain Restaurants Is Growing in NYC.’ Eater NY, Eater NY, 6 Nov. 2017, ny.eater.com/2017/11/6/16612300/chain-restaurant-growth-nyc-report.

‘The Inspection Process’. NYC Health, www1.nyc.gov/site/doh/business/food-operators/the-inspection-process.page

Acknowledgments

NYC OpenData

Geocodio